Credit card fraud is the unauthorized use of a credit or debit card to make purchases. When it comes to credit card fraud, everyone pays the price. Consumers and the businesses that serve them all suffer from fraudulent activity. And the costs can be staggering. Global financial losses related to payment cards are estimated to reach $34.66 billion in 2022. Everyone along the payment lifecycle is impacted by a fraudulent transaction—from the consumer who makes purchases in person or online using a credit or debit card to the merchant who finalizes that purchase.
Credit card companies have an obligation to protect their customers’ finances and they employ fraud detection models to identify unusual financial activity and freeze a user’s credit card if transaction activity is out of the ordinary for a given individual. The penalty for mislabeling a fraud transaction as legitimate is having a user’s money stolen, which the credit card company typically reimburses. On the other hand, the penalty for mislabeling a legitimate transaction as fraud is having the user frozen out of their finances and unable to make payments. There is a very fine tradeoff between these two consequences and we will discuss how to handle this when training a model.

The dataset is from Kaggle. This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. (Thanks to Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.) There are 23 columns and 1,296,675 rows.

pip install plotly
Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.9/site-packages (5.6.0) Requirement already satisfied: six in /opt/anaconda3/lib/python3.9/site-packages (from plotly) (1.16.0) Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from plotly) (8.0.1) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # For Logistic Regression Model
from sklearn.tree import DecisionTreeClassifier # For Desicion Tree Classification Model
from sklearn.ensemble import RandomForestClassifier # For Random Forest Classification Model
from sklearn.model_selection import GridSearchCV # For hyperparameters tuning
from sklearn.preprocessing import LabelEncoder # For converted categorical variables to numerical variables
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score
import plotly.express as px
fraud = pd.read_csv('fraud.csv')
fraud.head()
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2019-01-01 00:00:18 | 2703186189652095 | fraud_Rippin, Kub and Mann | misc_net | 4.97 | Jennifer | Banks | F | 561 Perry Cove | ... | 36.0788 | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 0b242abb623afc578575680df30655b9 | 1325376018 | 36.011293 | -82.048315 | 0 |
| 1 | 1 | 2019-01-01 00:00:44 | 630423337322 | fraud_Heller, Gutmann and Zieme | grocery_pos | 107.23 | Stephanie | Gill | F | 43039 Riley Greens Suite 393 | ... | 48.8878 | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1f76529f8574734946361c461b024d99 | 1325376044 | 49.159047 | -118.186462 | 0 |
| 2 | 2 | 2019-01-01 00:00:51 | 38859492057661 | fraud_Lind-Buckridge | entertainment | 220.11 | Edward | Sanchez | M | 594 White Dale Suite 530 | ... | 42.1808 | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | a1a22d70485983eac12b5b88dad1cf95 | 1325376051 | 43.150704 | -112.154481 | 0 |
| 3 | 3 | 2019-01-01 00:01:16 | 3534093764340240 | fraud_Kutch, Hermiston and Farrell | gas_transport | 45.00 | Jeremy | White | M | 9443 Cynthia Court Apt. 038 | ... | 46.2306 | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 6b849c168bdad6f867558c3793159a81 | 1325376076 | 47.034331 | -112.561071 | 0 |
| 4 | 4 | 2019-01-01 00:03:06 | 375534208663984 | fraud_Keeling-Crist | misc_pos | 41.96 | Tyler | Garcia | M | 408 Bradley Rest | ... | 38.4207 | -79.4629 | 99 | Dance movement psychotherapist | 1986-03-28 | a41d7549acf90789359a9aa5346dcb46 | 1325376186 | 38.674999 | -78.632459 | 0 |
5 rows × 23 columns
fraud.info()
# no missing value
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1296675 entries, 0 to 1296674 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1296675 non-null int64 1 trans_date_trans_time 1296675 non-null object 2 cc_num 1296675 non-null int64 3 merchant 1296675 non-null object 4 category 1296675 non-null object 5 amt 1296675 non-null float64 6 first 1296675 non-null object 7 last 1296675 non-null object 8 gender 1296675 non-null object 9 street 1296675 non-null object 10 city 1296675 non-null object 11 state 1296675 non-null object 12 zip 1296675 non-null int64 13 lat 1296675 non-null float64 14 long 1296675 non-null float64 15 city_pop 1296675 non-null int64 16 job 1296675 non-null object 17 dob 1296675 non-null object 18 trans_num 1296675 non-null object 19 unix_time 1296675 non-null int64 20 merch_lat 1296675 non-null float64 21 merch_long 1296675 non-null float64 22 is_fraud 1296675 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 227.5+ MB
fraud.describe()
| Unnamed: 0 | cc_num | amt | zip | lat | long | city_pop | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 | 1.296675e+06 |
| mean | 6.483370e+05 | 4.171920e+17 | 7.035104e+01 | 4.880067e+04 | 3.853762e+01 | -9.022634e+01 | 8.882444e+04 | 1.349244e+09 | 3.853734e+01 | -9.022646e+01 | 5.788652e-03 |
| std | 3.743180e+05 | 1.308806e+18 | 1.603160e+02 | 2.689322e+04 | 5.075808e+00 | 1.375908e+01 | 3.019564e+05 | 1.284128e+07 | 5.109788e+00 | 1.377109e+01 | 7.586269e-02 |
| min | 0.000000e+00 | 6.041621e+10 | 1.000000e+00 | 1.257000e+03 | 2.002710e+01 | -1.656723e+02 | 2.300000e+01 | 1.325376e+09 | 1.902779e+01 | -1.666712e+02 | 0.000000e+00 |
| 25% | 3.241685e+05 | 1.800429e+14 | 9.650000e+00 | 2.623700e+04 | 3.462050e+01 | -9.679800e+01 | 7.430000e+02 | 1.338751e+09 | 3.473357e+01 | -9.689728e+01 | 0.000000e+00 |
| 50% | 6.483370e+05 | 3.521417e+15 | 4.752000e+01 | 4.817400e+04 | 3.935430e+01 | -8.747690e+01 | 2.456000e+03 | 1.349250e+09 | 3.936568e+01 | -8.743839e+01 | 0.000000e+00 |
| 75% | 9.725055e+05 | 4.642255e+15 | 8.314000e+01 | 7.204200e+04 | 4.194040e+01 | -8.015800e+01 | 2.032800e+04 | 1.359385e+09 | 4.195716e+01 | -8.023680e+01 | 0.000000e+00 |
| max | 1.296674e+06 | 4.992346e+18 | 2.894890e+04 | 9.978300e+04 | 6.669330e+01 | -6.795030e+01 | 2.906700e+06 | 1.371817e+09 | 6.751027e+01 | -6.695090e+01 | 1.000000e+00 |
fraud.shape
(1296675, 23)
# Target Variable - is_fraud
ax = sns.countplot(x = 'is_fraud', data = fraud)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.33, p.get_height()+160))
fraud.is_fraud.mean()
# The fraud rate is 0.58%. It is an imbalanced dataset.
0.005788651743883394
labels=["Genuine","Fraud"]
fraud_or_not = fraud["is_fraud"].value_counts().tolist()
values = [fraud_or_not[0], fraud_or_not[1]]
fig = px.pie(values=fraud['is_fraud'].value_counts(), names=labels , width=700, height=400, color_discrete_sequence=["skyblue","purple"]
,title="Fraud vs Genuine Transactions")
fig.show()
#Plotting the heat map to find the correlation between the columns
plt.figure(figsize=(15,10))
sns.heatmap(fraud.corr(),annot=True, linewidths=0.5, cmap = "Blues")
plt.show()
# merchant
fraud[fraud.is_fraud == 1].merchant.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by Merchant (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].merchant.value_counts()
# Insight: merchant could be a predictor, fraud_Rau and Sons, Fraud Cormier LLC have fraud_Kozey-Boehm.. have higher fraud activities.
fraud_Rau and Sons 49
fraud_Cormier LLC 48
fraud_Kozey-Boehm 48
fraud_Doyle Ltd 47
fraud_Vandervort-Funk 47
..
fraud_Kuphal-Toy 1
fraud_Eichmann-Kilback 1
fraud_Lynch-Mohr 1
fraud_Tillman LLC 1
fraud_Hills-Olson 1
Name: merchant, Length: 679, dtype: int64
# category:
sns.countplot(x= 'category',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Transaction Category")
plt.xticks(rotation=80)
plt.show()
# Insight: category could be a good predictor. shoping_net and grocery_pos seem to have relative higher fraud acitivites.
# Gender
sns.countplot(x= 'gender',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Gender")
plt.show()
# Insight: hard to tell the difference from gender
#Relation between Gender and Fraud
ax=sns.histplot(x='gender',data=fraud, hue='is_fraud',stat='percent',multiple='dodge',common_norm=False)
ax.set_ylabel('Percentage')
ax.set_xlabel('Credit Card Holder Gender')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.title("Relation between Gender and Fraud")
plt.show()
# State
plt.figure(figsize=(20,8))
sns.countplot(x= 'state',data = fraud[fraud.is_fraud == 1])
plt.xticks(rotation=90)
plt.title("Number of Credit Card Fraud by State")
plt.show()
# Insight: state could be a good predictor. States NY, TX,PA report the most number of fraud.
#plot a geographical map of United States with different color schemes showing the intensity of fraud transactions that happened
import plotly.express as px
df = fraud.groupby('state').sum()['is_fraud'].to_frame()
df.reset_index(inplace =True)
df = df.rename(columns= {'state':'State', 'is_fraud':'Fraud Transactions'})
fig = px.choropleth(df,
locations='State',
color='Fraud Transactions',
locationmode='USA-states',
color_continuous_scale="Pinkyl",
labels={'States':'Fraud Transactions'},
scope='usa')
fig.add_scattergeo(
locations=df['State'],
locationmode='USA-states',
text=df['State'],
mode='text'
)
fig.show()
# City:
fraud[fraud.is_fraud == 1].city.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by City (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].city.value_counts()
# Insight:city could be a good predictor. Cities Houston, Warren, Huntsville report the most number of fraud.
Houston 39
Warren 33
Huntsville 29
Naples 29
Dallas 27
..
Florence 3
Kilgore 2
Phoenix 2
Phenix City 2
Denham Springs 2
Name: city, Length: 702, dtype: int64
# Job
fraud[fraud.is_fraud == 1].job.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by Job (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].job.value_counts()
# Insight: Job could be a good predictor. Materials engineer, Trading starndards officer, Navel architect report the most number of fraud.
Materials engineer 62
Trading standards officer 56
Naval architect 53
Exhibition designer 51
Surveyor, land/geomatics 50
..
Statistician 3
Health physicist 3
Chartered loss adjuster 3
English as a second language teacher 2
Contractor 2
Name: job, Length: 443, dtype: int64
# Generate some new Categorical Variables:
# convert trans_data_trans_time from str to datatime format
fraud['trans_date'] = pd.to_datetime(fraud['trans_date_trans_time'], format = "%Y-%m-%d %H:%M:%S")
# extract the transaction month of the year
fraud['trans_month'] = fraud['trans_date'].dt.month
sns.countplot(x= 'trans_month',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Month")
plt.show()
# Insight: trans_month could be a predictor.
# extract the transaction day of week
fraud['trans_week_day'] = fraud['trans_date'].dt.day_name()
sns.countplot(x= 'trans_week_day',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Day of Week")
plt.show()
#Insight: trans_week_day could be a good predictor, Saturday, Sunday,Monday report the most number of fraud.
# extract the transaction hour of day
fraud['trans_hour']=fraud['trans_date'].dt.hour
plt.figure(figsize=(20,8))
sns.countplot(x= 'trans_hour',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by hour")
plt.show()
#Insight: trans_hour could be a really good predictor, Hours of 22, 23, 0, 1, 2, 3 report the most number of fraud.
#amount vs fraud
ax=sns.histplot(x='amt',data=fraud[fraud.amt<=1000],hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Transaction Amount in USD')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: amount could be a good predictor
#age
fraud['age']=fraud['trans_date'].dt.year-pd.to_datetime(fraud['dob']).dt.year
ax=sns.kdeplot(x='age',data=fraud, hue='is_fraud', common_norm=False)
ax.set_xlabel('Credit Card Holder Age')
ax.set_ylabel('Density')
plt.xticks(np.arange(0,110,5))
plt.title('Age Distribution in Fraudulent vs Non-Fraudulent Transactions')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
ax=sns.histplot(x='age',data=fraud,hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Credit Card Holder Age')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: Age could be a predictor.
pip install h3
Collecting h3
Downloading h3-3.7.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 62.9 MB/s eta 0:00:00
Installing collected packages: h3
Successfully installed h3-3.7.6
Note: you may need to restart the kernel to use updated packages.
# calculate distanc by using lat, long, merch_lat, merch_long
# need Latitude and Longitude to calculate the distance between two locations with following formula: =acos(sin(lat1)*sin(lat2)+cos(lat1)*cos(lat2)*cos(lon2-lon1))*6371
import h3
fraud['distance']= fraud.apply(lambda row: h3.point_dist((row['lat'],row['long']),(row['merch_lat'],row['merch_long'])),axis=1)
ax=sns.histplot(x='distance',data=fraud,hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Distance')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: distance could be a predictor.
fig, axes = plt.subplots(1,3,figsize=(20,5))
sns.boxplot(x =fraud.is_fraud,y=fraud[fraud.amt<=1000].amt, ax=axes[0]).set(title='Fraud vs Transaction Amount')
sns.boxplot(x =fraud.is_fraud,y=fraud.age, ax=axes[1]).set(title='Fraud vs Age')
sns.boxplot(x =fraud.is_fraud,y=fraud.distance, ax=axes[2]).set(title='Fraud vs Distance')
plt.show()
# Insight: amt could be a pretty good predictor, age could be a predictor, distance is hard to tell from boxplot, mostly overlap.
fraud.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1296675 entries, 0 to 1296674 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1296675 non-null int64 1 trans_date_trans_time 1296675 non-null object 2 cc_num 1296675 non-null int64 3 merchant 1296675 non-null object 4 category 1296675 non-null object 5 amt 1296675 non-null float64 6 first 1296675 non-null object 7 last 1296675 non-null object 8 gender 1296675 non-null object 9 street 1296675 non-null object 10 city 1296675 non-null object 11 state 1296675 non-null object 12 zip 1296675 non-null int64 13 lat 1296675 non-null float64 14 long 1296675 non-null float64 15 city_pop 1296675 non-null int64 16 job 1296675 non-null object 17 dob 1296675 non-null object 18 trans_num 1296675 non-null object 19 unix_time 1296675 non-null int64 20 merch_lat 1296675 non-null float64 21 merch_long 1296675 non-null float64 22 is_fraud 1296675 non-null int64 23 trans_date 1296675 non-null datetime64[ns] 24 trans_month 1296675 non-null int64 25 trans_week_day 1296675 non-null object 26 trans_hour 1296675 non-null int64 27 age 1296675 non-null int64 28 distance 1296675 non-null float64 dtypes: datetime64[ns](1), float64(6), int64(9), object(13) memory usage: 286.9+ MB
#prepare data for modeling
# convert categorical variables to numerical format
labelencoder = LabelEncoder()
fraud['merchant']=labelencoder.fit_transform(fraud['merchant'])
fraud['category']=labelencoder.fit_transform(fraud['category'])
fraud['gender']=labelencoder.fit_transform(fraud['gender'])
fraud['city']=labelencoder.fit_transform(fraud['city'])
fraud['state']=labelencoder.fit_transform(fraud['state'])
fraud['job']=labelencoder.fit_transform(fraud['job'])
fraud['trans_week_day']=labelencoder.fit_transform(fraud['trans_week_day'])
feature_cols = ['merchant','category', 'gender','city', 'state', 'job','trans_month','trans_week_day','trans_hour','age','distance','amt']
X = fraud[feature_cols] # Features
y = fraud['is_fraud'] # Target variable
# define the scaler
scaler = MinMaxScaler()
# fit and transform the train set
X[['age', 'distance','amt']] = scaler.fit_transform(X[['age', 'distance','amt']])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create Decision Tree classifer object
dtm = DecisionTreeClassifier(criterion="entropy", max_depth=8)
# Train Decision Tree Classifer
dtm = dtm.fit(X_train,y_train)
# Hyperparameters tuning for Dasicion Tree model
train_score = []
test_score = []
max_score = 0
max_pair = (0,0)
for i in range(1,50):
tree = DecisionTreeClassifier(max_depth=i,random_state=42)
tree.fit(X_train,y_train)
y_pred = tree.predict_proba(X_train)[:,1]
y_pred_t = tree.predict_proba(X_test)[:,1]
train_score.append(metrics.roc_auc_score(y_train,y_pred))
test_score.append(metrics.roc_auc_score(y_test, y_pred_t))
test_pair = (i,metrics.roc_auc_score(y_test,y_pred_t))
if test_pair[1] > max_pair[1]:
max_pair = test_pair
fig, ax = plt.subplots()
ax.plot(np.arange(1,50), train_score, label = "roc_auc_score",color='purple')
ax.plot(np.arange(1,50), test_score, label = "roc_auc_score",color='lime')
print(f'Best max_depth is: {max_pair[0]} \nroc_auc_score is: {max_pair[1]}')
Best max_depth is: 8 roc_auc_score is: 0.9843941728418165
#define metrics
y_pred_proba = dtm.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Calculate the G-mean
gmean = np.sqrt(tpr * (1 - fpr)) # using G-mean
# Find the optimal threshold
index = np.argmax(gmean)
thresholdOpt = round(thresholds[index], ndigits = 4)
gmeanOpt = round(gmean[index], ndigits = 4)
fprOpt = round(fpr[index], ndigits = 4)
tprOpt = round(tpr[index], ndigits = 4)
print('Best Threshold: {} with G-Mean: {}'.format(thresholdOpt, gmeanOpt))
print('FPR: {}, TPR: {}'.format(fprOpt, tprOpt))
Best Threshold: 0.0064 with G-Mean: 0.9625 FPR: 0.0286, TPR: 0.9536
#Predict the response for test dataset
# select the right threshold to make sure the recall of "1" category is higher
threshold = 0.0286
y_pred = (dtm.predict_proba(X_test)[:, 1] > threshold).astype('float')
dtm_matrix = metrics.confusion_matrix(y_test, y_pred)
print(dtm_matrix)
dtm_report = metrics.classification_report(y_test,y_pred)
print(dtm_report)
[[382257 4461]
[ 169 2116]]
precision recall f1-score support
0 1.00 0.99 0.99 386718
1 0.32 0.93 0.48 2285
accuracy 0.99 389003
macro avg 0.66 0.96 0.74 389003
weighted avg 1.00 0.99 0.99 389003
#Predict the response for test dataset
# select the right threshold to make sure the F1-score is higher
threshold = 0.25
y_pred = (dtm.predict_proba(X_test)[:, 1] > threshold).astype('float')
np.set_printoptions(precision=1)
dtm_matrix = metrics.confusion_matrix(y_test, y_pred)
print(dtm_matrix)
dtm_report = metrics.classification_report(y_test,y_pred,digits=4)
print(dtm_report)
[[386511 207]
[ 556 1729]]
precision recall f1-score support
0 0.9986 0.9995 0.9990 386718
1 0.8931 0.7567 0.8192 2285
accuracy 0.9980 389003
macro avg 0.9458 0.8781 0.9091 389003
weighted avg 0.9979 0.9980 0.9980 389003
resultdict = {}
for i in range(len(feature_cols)):
resultdict[feature_cols[i]] = dtm.feature_importances_[i]
plt.bar(resultdict.keys(),resultdict.values())
plt.xticks(rotation='vertical')
plt.title('Feature Importance in Decision Tree Model')
# amt, category,trans_hour, age,gender
Text(0.5, 1.0, 'Feature Importance in Decision Tree Model')
rf = RandomForestClassifier(random_state = 42, n_estimators=1000, bootstrap = True, max_depth=10,criterion='entropy')
rf.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=1000,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=1000,
random_state=42)#define metrics
y_pred_proba = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Calculate the G-mean
gmean = np.sqrt(tpr * (1 - fpr)) # using G-mean
# Find the optimal threshold
index = np.argmax(gmean)
thresholdOpt = round(thresholds[index], ndigits = 4)
gmeanOpt = round(gmean[index], ndigits = 4)
fprOpt = round(fpr[index], ndigits = 4)
tprOpt = round(tpr[index], ndigits = 4)
print('Best Threshold: {} with G-Mean: {}'.format(thresholdOpt, gmeanOpt))
print('FPR: {}, TPR: {}'.format(fprOpt, tprOpt))
Best Threshold: 0.0118 with G-Mean: 0.9692 FPR: 0.0305, TPR: 0.9689
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [5, 10, 20],
'n_estimators': [200, 500, 1000]
}
# Create a based model
rf_t = RandomForestClassifier(random_state = 42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_t, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
# Fit the grid search to the data.
# grid_search.fit(X_train, y_train)
# grid_search.best_params_
# could take a long time (>100mins)
y_pred2 = rf.predict(X_test)
rf_matrix = metrics.confusion_matrix(y_test, y_pred2)
print(rf_matrix)
rf_report = metrics.classification_report(y_test,y_pred2)
print(rf_report)
[[386694 24]
[ 893 1392]]
precision recall f1-score support
0 1.00 1.00 1.00 386718
1 0.98 0.61 0.75 2285
accuracy 1.00 389003
macro avg 0.99 0.80 0.88 389003
weighted avg 1.00 1.00 1.00 389003
#Predict the response for test dataset
# select the right threshold to make sure the F1-score is higher
threshold = 0.25
y_pred2 = (rf.predict_proba(X_test)[:, 1] > threshold).astype('float')
rf_matrix = metrics.confusion_matrix(y_test, y_pred2)
print(rf_matrix)
rf_report = metrics.classification_report(y_test,y_pred2,digits=4)
print(rf_report)
[[386502 216]
[ 492 1793]]
precision recall f1-score support
0 0.9987 0.9994 0.9991 386718
1 0.8925 0.7847 0.8351 2285
accuracy 0.9982 389003
macro avg 0.9456 0.8921 0.9171 389003
weighted avg 0.9981 0.9982 0.9981 389003
resultdict = {}
for i in range(len(feature_cols)):
resultdict[feature_cols[i]] = rf.feature_importances_[i]
plt.bar(resultdict.keys(),resultdict.values())
plt.xticks(rotation='vertical')
plt.title('Feature Importance in Random Forest Model')
# amt, trans_hour, category, age....
Text(0.5, 1.0, 'Feature Importance in Random Forest Model')
#set up plotting area
plt.figure(0).clf()
#fit decisiom tree model and plot ROC curve
y_pred = dtm.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Decision Tree, AUC="+str(auc))
#fit random forest model and plot ROC curve
y_pred = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Random Forest, AUC="+str(auc))
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.title(" AUC Comparison")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
#add legend
plt.legend()
<matplotlib.legend.Legend at 0x7fe58eff48e0>
Model Comparison
| Metric/Model | Decision Tree | Random Forest |
|---|---|---|
| Accuracy | 0.9980 | 0.9987 |
| Precision | 0.8926 | 0.8925 |
| Recall | 0.7567 | 0.7847 |
| F1-Score | 0.8190 | 0.8351 |
| AUC | 0.9842 | 0.9937 |
Based on these metrics, randome forest model has better performace. We will choose the Random Forest Model.
Imbalanced data refers to a situation, primarily in classification machine learning, where one target class represents a significant proportion of observations. Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class. There are several approaches to solving class imbalance problem before starting classification, such as:
Class Weights in the models
This section just shows how to deal with imbalanced dataset. We will only use 1% of the whole dataset to improve the speed of modeling and hypermeters tuning.
data = fraud.sample(frac=0.1, random_state = 42)
data.head()
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | city | state | zip | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | trans_date | trans_month | trans_week_day | trans_hour | age | distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1045211 | 1045211 | 2020-03-09 15:09:26 | 577588686219 | 629 | 9 | 194.51 | James | Strickland | 1 | 25454 Leonard Lake | 768 | 38 | 15686 | 40.6153 | -79.4545 | 972 | 378 | 1997-10-23 | fff87d4340ef756a592eac652493cf6b | 1362841766 | 40.420453 | -78.865012 | 0 | 2020-03-09 15:09:26 | 3 | 1 | 15 | 23 | 54.336180 |
| 547406 | 547406 | 2019-08-22 15:49:01 | 30376238035123 | 180 | 5 | 52.32 | Cynthia | Davis | 0 | 7177 Steven Forges | 750 | 37 | 97476 | 42.8250 | -124.4409 | 217 | 400 | 1928-10-01 | d0ad335af432f35578eea01d639b3621 | 1345650541 | 42.758860 | -123.636337 | 0 | 2019-08-22 15:49:01 | 8 | 4 | 15 | 91 | 66.060940 |
| 110142 | 110142 | 2019-03-04 01:34:16 | 4658490815480264 | 429 | 12 | 6.53 | Tara | Richards | 0 | 4879 Cristina Station | 400 | 38 | 15449 | 39.9636 | -79.7853 | 184 | 444 | 1945-11-04 | 87f26e3ea33f4ff4c7a8bad2c7f48686 | 1330824856 | 40.475159 | -78.898190 | 0 | 2019-03-04 01:34:16 | 3 | 1 | 1 | 74 | 94.386151 |
| 1285953 | 1285953 | 2020-06-16 20:04:38 | 3514897282719543 | 187 | 6 | 7.33 | Steven | Faulkner | 1 | 841 Cheryl Centers Suite 115 | 262 | 34 | 14425 | 42.9580 | -77.3083 | 10717 | 115 | 1952-10-13 | 9c34015321c0fa2ae6fd20f9359d1d3e | 1371413078 | 43.767506 | -76.542384 | 0 | 2020-06-16 20:04:38 | 6 | 5 | 20 | 68 | 109.251413 |
| 271705 | 271705 | 2019-05-14 05:54:48 | 6011381817520024 | 92 | 2 | 64.29 | Kristen | Allen | 0 | 8619 Lisa Manors Apt. 871 | 419 | 50 | 82221 | 41.6423 | -104.1974 | 635 | 358 | 1973-07-13 | 198437c05676f485e9be04449c664475 | 1336974888 | 41.040392 | -104.092324 | 0 | 2019-05-14 05:54:48 | 5 | 5 | 5 | 46 | 67.501592 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 129668 entries, 1045211 to 879092 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 129668 non-null int64 1 trans_date_trans_time 129668 non-null object 2 cc_num 129668 non-null int64 3 merchant 129668 non-null int64 4 category 129668 non-null int64 5 amt 129668 non-null float64 6 first 129668 non-null object 7 last 129668 non-null object 8 gender 129668 non-null int64 9 street 129668 non-null object 10 city 129668 non-null int64 11 state 129668 non-null int64 12 zip 129668 non-null int64 13 lat 129668 non-null float64 14 long 129668 non-null float64 15 city_pop 129668 non-null int64 16 job 129668 non-null int64 17 dob 129668 non-null object 18 trans_num 129668 non-null object 19 unix_time 129668 non-null int64 20 merch_lat 129668 non-null float64 21 merch_long 129668 non-null float64 22 is_fraud 129668 non-null int64 23 trans_date 129668 non-null datetime64[ns] 24 trans_month 129668 non-null int64 25 trans_week_day 129668 non-null int64 26 trans_hour 129668 non-null int64 27 age 129668 non-null int64 28 distance 129668 non-null float64 dtypes: datetime64[ns](1), float64(6), int64(16), object(6) memory usage: 29.7+ MB
data.is_fraud.mean()
## fraud: 0.6%
0.005961378289169263
feature_cols = ['merchant','category', 'gender','city', 'state', 'job','trans_month','trans_week_day','trans_hour','age','distance','amt']
X = data[feature_cols] # Features
y = data['is_fraud'] # Target variable
# define the scaler
scaler = MinMaxScaler()
# fit and transform the train set
X[['age', 'distance','amt']] = scaler.fit_transform(X[['age', 'distance','amt']])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [4, 6, 8, 10],
'n_estimators': [50, 100, 200]
}
# Create a based model
rf = RandomForestClassifier(random_state = 42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
# Fit the grid search to the data.
grid_search.fit(X_train, y_train)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 200}
rf = RandomForestClassifier(random_state = 42, n_estimators=200, bootstrap = True, max_depth=10,criterion='entropy')
rf.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=200,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=200,
random_state=42)# focus more on "recall"
y_pred_3 = rf.predict(X_test)
rf_matrix = metrics.confusion_matrix(y_test, y_pred3)
print(rf_matrix)
rf_report = metrics.classification_report(y_test,y_pred3,digits=4)
print(rf_report)
[[38520 139]
[ 47 195]]
precision recall f1-score support
0 0.9988 0.9964 0.9976 38659
1 0.5838 0.8058 0.6771 242
accuracy 0.9952 38901
macro avg 0.7913 0.9011 0.8373 38901
weighted avg 0.9962 0.9952 0.9956 38901
from imblearn.pipeline import Pipeline, make_pipeline
# Random Oversampling Imbalanced Datasets
from imblearn.over_sampling import RandomOverSampler
# define oversampling strategy
ros = RandomOverSampler(random_state=42)
# fit and apply the transform
X_over, y_over = ros.fit_resample(X_train, y_train)
print('Genuine:', y_over.value_counts()[0], '/', round(y_over.value_counts()[0]/len(y_over) * 100,2), '% of the dataset')
print('Frauds:', y_over.value_counts()[1], '/',round(y_over.value_counts()[1]/len(y_over) * 100,2), '% of the dataset')
Genuine: 90236 / 50.0 % of the dataset Frauds: 90236 / 50.0 % of the dataset
# Fit the grid search to the data.
grid_search.fit(X_over, y_over)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.5s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.2s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 8.8s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 8.9s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.4s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.8s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 2.8s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 3.0s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 3.0s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 5.8s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 11.5s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 12.0s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.8s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.2s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.6s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.5s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.6s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 14.5s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 14.9s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 8.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 8.4s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 17.0s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 4.5s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 4.6s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 4.6s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 9.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 18.6s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 18.0s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.4s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 0.1s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.5s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.2s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 2.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 8.9s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 9.0s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.6s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.4s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.7s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 2.8s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 3.0s [CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time= 2.9s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 5.8s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 11.3s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 11.9s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.9s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.5s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.2s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.5s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.7s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 3.6s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.3s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.6s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.4s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.4s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 8.9s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.7s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.7s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.7s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.8s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.8s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 5.6s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 6.1s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 11.9s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.8s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.9s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.5s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.4s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 14.4s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.7s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.4s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 8.4s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 16.9s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 16.9s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 9.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 9.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 18.4s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.4s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.4s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.4s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.4s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 0.3s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 0.1s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 0.2s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 0.3s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 0.2s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.4s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 0.4s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 1.6s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 3.2s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.5s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 6.4s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 4.3s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 8.8s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.7s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.6s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 2.6s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 5.7s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.6s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 10.7s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 5.6s [CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time= 6.1s [CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time= 11.8s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.9s [CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time= 2.8s [CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time= 5.4s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.1s [CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time= 11.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.1s [CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time= 7.2s [CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time= 14.9s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.2s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.2s [CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time= 4.3s [CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time= 8.5s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 16.7s [CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time= 16.9s
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
rf_over = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_over.fit(X_over, y_over)
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
# focus on "recall"
y_pred_over = rf_over.predict(X_test)
rf_over_matrix = metrics.confusion_matrix(y_test, y_pred_over)
print(rf_over_matrix)
rf_over_report = metrics.classification_report(y_test,y_pred_over,digits=4)
print(rf_over_report)
[[38509 150]
[ 48 194]]
precision recall f1-score support
0 0.9988 0.9961 0.9974 38659
1 0.5640 0.8017 0.6621 242
accuracy 0.9949 38901
macro avg 0.7814 0.8989 0.8298 38901
weighted avg 0.9961 0.9949 0.9953 38901
from imblearn.under_sampling import RandomUnderSampler
# define oversampling strategy
rus = RandomUnderSampler(random_state=42)
# fit and apply the transform
X_under, y_under = rus.fit_resample(X_train, y_train)
print('Genuine:', y_under.value_counts()[0], '/', round(y_under.value_counts()[0]/len(y_under) * 100,2), '% of the dataset')
print('Frauds:', y_under.value_counts()[1], '/',round(y_under.value_counts()[1]/len(y_under) * 100,2), '% of the dataset')
Genuine: 531 / 50.0 % of the dataset Frauds: 531 / 50.0 % of the dataset
# Fit the grid search to the data.
grid_search.fit(X_under, y_under)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 8, 'n_estimators': 200}
rf_under = RandomForestClassifier(random_state = 42, n_estimators=200, bootstrap = True, max_depth=8,criterion='entropy')
rf_under.fit(X_under, y_under)
RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=200,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=200,
random_state=42)# focus on "recall"
y_pred_under = rf_under.predict(X_test)
rf_under_matrix = metrics.confusion_matrix(y_test, y_pred_under)
print(rf_under_matrix)
rf_under_report = metrics.classification_report(y_test,y_pred_under,digits=4)
print(rf_under_report)
[[36258 2401]
[ 15 227]]
precision recall f1-score support
0 0.9996 0.9379 0.9678 38659
1 0.0864 0.9380 0.1582 242
accuracy 0.9379 38901
macro avg 0.5430 0.9380 0.5630 38901
weighted avg 0.9939 0.9379 0.9627 38901
# Since the recall is high, we could change the threshold to balance precision and recall.
threshold = 0.7
y_pred_under2 = (rf_under.predict_proba(X_test)[:, 1] > threshold).astype('float')
rf_under_matrix2 = metrics.confusion_matrix(y_test, y_pred_under2)
print(rf_under_matrix2)
rf_under_report2 = metrics.classification_report(y_test,y_pred_under2,digits=4)
print(rf_under_report2)
[[38134 525]
[ 41 201]]
precision recall f1-score support
0 0.9989 0.9864 0.9926 38659
1 0.2769 0.8306 0.4153 242
accuracy 0.9855 38901
macro avg 0.6379 0.9085 0.7040 38901
weighted avg 0.9944 0.9855 0.9890 38901
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy = 'minority', random_state=42)
X_smote, y_smote = sm.fit_resample(X_train, y_train)
print('Genuine:', y_smote.value_counts()[0], '/', round(y_smote.value_counts()[0]/len(y_smote) * 100,2), '% of the dataset')
print('Frauds:', y_smote.value_counts()[1], '/',round(y_smote.value_counts()[1]/len(y_smote) * 100,2), '% of the dataset')
Genuine: 90236 / 50.0 % of the dataset Frauds: 90236 / 50.0 % of the dataset
# Fit the grid search to the data.
grid_search.fit(X_smote, y_smote)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
rf_smote = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_smote.fit(X_smote, y_smote)
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
y_pred_smote = rf_smote.predict(X_test)
rf_smote_matrix = metrics.confusion_matrix(y_test, y_pred_smote)
print(rf_smote_matrix)
rf_smote_report = metrics.classification_report(y_test,y_pred_smote,digits=4)
print(rf_smote_report)
[[37893 766]
[ 59 183]]
precision recall f1-score support
0 0.9984 0.9802 0.9892 38659
1 0.1928 0.7562 0.3073 242
accuracy 0.9788 38901
macro avg 0.5956 0.8682 0.6483 38901
weighted avg 0.9934 0.9788 0.9850 38901
from imblearn.under_sampling import TomekLinks
# define the undersampling method
#tomekU = TomekLinks(sampling_strategy='auto', n_jobs=-1)
tomekU = TomekLinks()
# fit and apply the transform
X_underT, y_underT = tomekU.fit_resample(X_train, y_train)
print('Genuine:', y_underT.value_counts()[0], '/', round(y_underT.value_counts()[0]/len(y_underT) * 100,2), '% of the dataset')
print('Frauds:', y_underT.value_counts()[1], '/',round(y_underT.value_counts()[1]/len(y_underT) * 100,2), '% of the dataset')
Genuine: 89980 / 99.41 % of the dataset Frauds: 531 / 0.59 % of the dataset
# Fit the grid search to the data.
grid_search.fit(X_underT, y_underT)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
rf_underT = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_underT.fit(X_underT, y_underT)
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
y_pred_underT = rf_underT.predict(X_test)
rf_underT_matrix = metrics.confusion_matrix(y_test, y_pred_underT)
print(rf_underT_matrix)
rf_underT_report = metrics.classification_report(y_test,y_pred_underT,digits=4)
print(rf_underT_report)
[[38654 5]
[ 109 133]]
precision recall f1-score support
0 0.9972 0.9999 0.9985 38659
1 0.9638 0.5496 0.7000 242
accuracy 0.9971 38901
macro avg 0.9805 0.7747 0.8493 38901
weighted avg 0.9970 0.9971 0.9967 38901
# Since the precision is high, we could change the threshold to balance precision and recall.
threshold = 0.1
y_pred_underT2 = (rf_underT.predict_proba(X_test)[:, 1] > threshold).astype('float')
rf_underT_matrix2 = metrics.confusion_matrix(y_test, y_pred_underT2)
print(rf_underT_matrix2)
rf_underT_report2 = metrics.classification_report(y_test,y_pred_underT2,digits=4)
print(rf_underT_report2)
[[38460 199]
[ 41 201]]
precision recall f1-score support
0 0.9989 0.9949 0.9969 38659
1 0.5025 0.8306 0.6262 242
accuracy 0.9938 38901
macro avg 0.7507 0.9127 0.8115 38901
weighted avg 0.9958 0.9938 0.9946 38901
from imblearn.combine import SMOTETomek
st = SMOTETomek()
# fit and apply the transform
X_st, y_st = st.fit_resample(X_train, y_train)
print('Genuine:', y_st.value_counts()[0], '/', round(y_st.value_counts()[0]/len(y_st) * 100,2), '% of the dataset')
print('Frauds:', y_st.value_counts()[1], '/',round(y_st.value_counts()[1]/len(y_st) * 100,2), '% of the dataset')
Genuine: 90233 / 50.0 % of the dataset Frauds: 90233 / 50.0 % of the dataset
# Fit the grid search to the data.
grid_search.fit(X_st, y_st)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
rf_st = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_st.fit(X_st, y_st)
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
y_pred_st = rf_st.predict(X_test)
rf_st_matrix = metrics.confusion_matrix(y_test, y_pred_st)
print(rf_st_matrix)
rf_st_report = metrics.classification_report(y_test,y_pred_st,digits=4)
print(rf_st_report)
[[37837 822]
[ 56 186]]
precision recall f1-score support
0 0.9985 0.9787 0.9885 38659
1 0.1845 0.7686 0.2976 242
accuracy 0.9774 38901
macro avg 0.5915 0.8737 0.6431 38901
weighted avg 0.9935 0.9774 0.9842 38901
Most of the machine learning models provide a parameter called class_weights. For example, in a random forest classifier using, class_weights we can specify a higher weight for the minority class using a dictionary.
Without weights set, the model treats each point as equally important. Weights scale the loss function. As the model trains on each point, the error will be multiplied by the weight of the point. The estimator will try to minimize error on the more heavily weighted classes, because they will have a greater effect on error, sending a stronger signal.
# If you choose class_weight = "balanced",
# the classes will be weighted inversely proportional to how frequently they appear in the data.
rfb = RandomForestClassifier(random_state=42, class_weight="balanced")
grid_search = GridSearchCV(estimator = rfb, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
# Fit the grid search to the data.
grid_search.fit(X_train, y_train)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 200}
rfb = RandomForestClassifier(random_state = 42, class_weight="balanced", n_estimators=200, bootstrap = True, max_depth=10,criterion='entropy')
rfb.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=10, n_estimators=200, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=10, n_estimators=200, random_state=42)y_pred_rfb = rfb.predict(X_test)
rfb_matrix = metrics.confusion_matrix(y_test, y_pred_rfb)
print(rfb_matrix)
rfb_report = metrics.classification_report(y_test,y_pred_rfb,digits=4)
print(rfb_report)
[[38588 71]
[ 59 183]]
precision recall f1-score support
0 0.9985 0.9982 0.9983 38659
1 0.7205 0.7562 0.7379 242
accuracy 0.9967 38901
macro avg 0.8595 0.8772 0.8681 38901
weighted avg 0.9967 0.9967 0.9967 38901
#set up plotting area
plt.figure(0).clf()
#fit decisiom tree model and plot ROC curve
y_pred = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Base, AUC="+str(auc))
#fit random forest model and plot ROC curve
y_pred = rf_over.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Oversampling, AUC="+str(auc))
y_pred = rf_under.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Undersampling, AUC="+str(auc))
y_pred = rf_smote.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="SMOTE, AUC="+str(auc))
y_pred = rf_underT.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Tomek_Links, AUC="+str(auc))
y_pred = rf_st.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="SMOTE_Tomek, AUC="+str(auc))
y_pred = rfb.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Class_Weights, AUC="+str(auc))
plt.plot([0, 1], [0, 1], color='purple', linestyle='--')
plt.title(" AUC Comparison")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
#add legend
plt.legend()
<matplotlib.legend.Legend at 0x7fe581887910>
Model Comparison
| Metric/Method | Base | Oversampling | Undersampling | SMOTE | Tomek_Links | SMOTE_Tomek | Class_Weights | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.9952 | 0.9949 | 0.9379 | 0.9788 | 0.9971 | 0.9774 | 0.9967 | |
| Precision | 0.5838 | 0.5640 | 0.0864 | 0.1928 | 0.9638 | 0.1845 | 0.7205 | |
| Recall | 0.8058 | 0.8017 | 0.9380 | 0.7562 | 0.5496 | 0.7686 | 0.7562 | |
| F1-Score | 0.6771 | 0.6621 | 0.1582 | 0.3073 | 0.7000 | 0.2976 | 0.7379 | |
| AUC | 0.9873 | 0.9883 | 0.9818 | 0.9467 | 0.9497 | 0.9474 | 0.9474 | 0.9888 |
Based on these metrics, Class Weights method has the best performace. We will choose the Random Forest Model with Class Weights method.